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ABSTRACT 

This  work  investigates  the  application  of  evolutionary  programming,  a  multi-agent  stochastic  search  technique,  to 
the  generation  of  recurrent  perceptrons  (nonlinear  HR  filters)  for  time-series  prediction  tasks.  The  evolutionary 
programming  paradigm  is  discussed  and  analogies  are  made  to  classical  stochastic  optimization  methods.  A  hybrid 
optimization  scheme  is  proposed  based  on  multi-agent  and  single-agent  random  optimization  techniques.  This  method  is 
then  used  to  determine  both  the  model  order  and  weight  coefficients  of  linear,  nonlinear,  and  parallel  linear-nonlinear  next- 
step  predictors.  The  AIC  is  used  as  the  cost  function  to  score  each  candidate  solution. 

1.  INTRODUCTION 

Networks  with  recurrent  connections  represent  an  alternative  to  feedforward  networks  for  predicting  and 
classifying  nonlinear  time-series  data.  Recurrent  networks  can  be  transformed  to  purely  feedforward  networks  by  unfolding 
in  time,  where  additional  layers  correspond  to  each  time  step.  This  technique  may  be  prohibitive  if  the  time  structure  is  not 
well  known  or  may  become  very  complicated  if  the  topology  is  dynamic.  As  a  result,  training  can  be  greatly  complicated  by 
the  addition  of  recurrent  units  in  a  dynamic  artificial  neural  network.  This  work  investigates  the  application  of  multi-agent 
stochastic  search  in  determining  the  number  of  recurrencies  as  well  as  the  weight  coefficients  of  a  single  perceptron.  A 
recurrent  perceptron  can  be  used  as  a  fundamental  node  in  recurrent  network.  The  "perceptron"  in  this  case  simply  refers  to 
an  infinite  impulse  response  (IIRl  filter  with  potentially  nonlinear  outputs.  The  application  domain  in  this  study  is 
constrained  to  next-step  prediction  problems. 

Simultaneously  determinin':  both  perceptron  weight  coefficients  and  structure  requires  a  search  procedure  that  is 
amenable  to  combinatorial  optimization  problems.  The  more  successful  algorithms  for  these  types  of  problems  have 
generally  been  stochastic  search  techniques  such  as  simulated  annealing  ^  genetic  algorithms^,  and  simulated  evolution^ 
The  simulated  evolution,  or  evolutionary  programming  (EP),  paradigm  has  been  shown  to  have  the  desired  atuibutes: 
combinatorial  optimization  capabilities^,  the  ability  to  det^mine  model  structure^,  and  the  ability  to  train  neural  networks®. 

Crick  and  Asanuma^  indicate  that  recurrency  is  the  "general  rule"  as  reciprocal  connections  are  generated  for 
projections  from  one  cortical  area  to  another  cortical  area.  In  partially  recurrent  artificial  neural  networks,  the  units  which 
receive  feedback  are  generally  referred  to  as  "context"  units*  since  they  provide  information  about  past  activation  levels  of 
the  output  or  the  hidden  units.  Williams^  characterizes  recunency  based  on  its  utilization  in  a  connectionist  architecture. 
Conservative  recurrence  corresponds  to  a  tapped-delay  input  signal.  This  approach  yields  a  network  which  is  sensitive  to 
temporal  patterns  without  directly  incorporating  recurrent  units.  This  technique  has  been  widely  applied  in  the  field  of 
speech  recognition  (c./.,  Waibel  et  Liberal  recurrence  is  the  feedback  from  the  output  to  the  input  units  such  as 

implemented  by  Jordan’  ’ .  Radical  recurrence  encompasses  both  conservative  and  liberal  recurrency  as  well  as  arbitrary 
recurrency  among  the  hidden  units.  Elman's'^  architecture  is  representative  of  radical  recurrency. 

This  work  applies  recurrency  to  a  single  processing  unit  similar  to  the  adaptive  linear  combiners  di.scussed  by 
Widrow  and  Steams’^.  A  significant  difference  is  that  nonlinearities  are  imposed  on  the  output  of  the  linear  combiners.  If 
a  hyperbolic  tangent  is  used  as  the  nonlinearity,  then  the  linear  combiner  becomes  a  recurrent  percepuon  or  an  HR  filter 
with  nonlinear  output  transformations.  This  investigation  takes  advantage  of  the  EP  framework  to  evolve  the  number  of 
tapped-delay  inputs  as  well  as  the  number  of  tapped-delay  feedbacks.  Determining  the  model  order  of  ARM.^  processes 
using  EP  was  done  by  FogeP.  Recurrent  .structures  have  been  successfully  trained  using  EP  by  Angeline  et  al.''*. 
Saravanan’®,  and  McDonnell  and  Waagen'®.  Gradient  methods  for  training  recurrent  nets  are  given  by  Rumelhan  et  al.'^ 
and  Williams  and  Zipser'*. 


The  application  domain  invesugated  in  this  work  is  ume-series  prediction,  l  eedtorward  networks  have  been  used 
with  success  for  both  system  modeling  and  prediction  .  Narendra  and  Parthasarathy‘‘^  used  feedforward  networks  for 
system  identification  and  control.  Jones  et  al?^  have  formulated  the  Connectionist  Normalized  Linear  Spline  Network 
(CNLS)  for  time-series  prediction.  The  CNLS  is  a  feedforward  network  that  incorporates  Gaussian  activations  on  the 
hidden  units  and  normalizes  the  basis  functions  during  training.  Weigend  et  use  weight-elimination  on  feedforward 
networks  to  prevent  overfitting  the  training  data  for  a  prediction  task.  Hinton  et  have  demonstrated  the  use  of  weight¬ 
sharing  for  training  feedforward  networks  with  prediction  capabilities. 

The  next  section  describes  Solis&Wets'  random  optimization  technique  and  the  general  evolutionary  programing 
algorithm.  Variants  of  both  methods  are  applied  to  finding  extrema  of  an  unknown  function.  A  hybrid  strategy  is 
subsequently  developed  which  embeds  Solis&Wets'  technique  within  the  EP  framework.  Results  of  the  hybrid  approach  for 
function  optimization  are  also  provided.  The  structure  of  the  recurrent  perceptron  is  then  discussed.  Finally,  results  are 
given  for  next-step  prediction  performance  on  some  standard  time-series  benchmarks. 

2.  MULTI-AGENT  STOCHASTIC  SEARCH 

2.1.  Single- Agent  Stochastic  Search 

Random  optimization  has  Uaditionally  been  based  on  single-agent  stochastic  search  (SASS)  strategies.  Kamop'^^ 
discusses  the  benefits  of  using  SASS  strategies  including  the  discovery  of  features  of  the  cost  surface.  Both  Kamop^^  and 
Rao^^  generate  a  random  walk  sequence  to  an  extremum  by  perturbing  the  search  point  with  a  uniform  random  variable. 
Rao  exploits  the  directionality  of  the  randomly  generated  vectors  which  continue  to  yield  lower  valued  objective  functions. 
In  a  similar  algorithmic  formulation  as  Rao,  Matyas^^  utilizes  Gaussian  perturbations  about  the  search  point.  Solis  & 
Wetts^^  have  upgraded  Matyas'  random  optimization  approach  by  incorporating  a  bias  in  the  mutation  operator  and 
evaluating  the  objective  function  at  x-5jc  if  evaluation  at  x+8x  does  not  improve  the  current  value  of  the  objective  function. 
Baba^^  has  successfully  applied  this  SASS  technique  to  training  static  networks.  Simulated  annealing^^  is  also  another 
example  of  SASS.  The  probabilistic  nature  of  setting  x=x-t-5jt  if  /fx-t-Sxj  >  fix)  provides  a  hill-climbing  capability  not 
contained  in  the  methods  previously  discussed. 

The  basic  algorithm  formulated  by  Solis  &  Wets^^  is  described  as  follows 

1.  Initialize  the  search  vector Xq.  setk  =  0 and b  =  0. 

2.  Generate  a  Gaussian  random  vector  ~N(b,o). 

3(a).  IffiXj^  +  <fix^),  then  setx^^j  =  Xj^  -h  and  b^^i  =  +  O.Zb^. 

3(b).  Iffix^  -  f,^)  <fix^)  <fix^  +  then  setx^^,  =x^-^  and  b^^,  =  b^  -  0.4i,^. 

3(c).  Otherwise,  Xj^^,  =  x,^  and  =  O.Sb/^ 

4.  Ifk  =  maximum  number  of  iterations  then  stop,  else  k=k+l  and  go  to  Step  2. 

In  the  basic  algorithm,  the  variance  on  ^  is  controlled  by  the  repeated  number  of  successes  or  failures  in  decreasing  the 
objective  function /.  The  contraction  and  expansion  constants,  as  well  as  the  upper  limits  on  the  allowable  number  of  uials, 
are  set  by  the  user. 

This  technique  was  modified  so  that  the  variance  of  the  Gaussian  perturbations  are  proportional  to  the  magnitude 
of  the  objective  function  ^  -  N(b,  fix)).  This  optimization  process  was  applied  to  the  Bohachevsky  function 

/(x)  =  xf +2x; -0.3cos(3tu,)-0.4cos(47u:2)  +  0.7.  The  transcendental  terms  cause  many  local  minima  in.side  the 
interval  xe[-I,l|  while  the  quadratic  terms  dominate  the  surface  structure  outside  of  this  interval  A  global  minimum 
exists  at  x=(0,0).  The  modified  Solis  &  Wets  optimization  technique  was  used  for  finding  the  global  minimum  on  the 
Bohachevsky  surface  as  shown  in  Fig.  1  for  the  average  of  ten  .sessions.  It  is  interesting  to  note  that  an  average  of  1.^0 
function  evaluations  were  conducted  for  each  session.  As  a  comparison,  the  unmodified  version  of  the  basic  algoriilim  was 
used  to  find  the  global  minimum  of  the  Bohachevsky  function  with  the  average  results  of  ten  sessions  shown  in  Fig.  2.  On 
average,  each  optimization  session  contained  189  function  evaluations.  It  should  be  noted  that  in  one  instance,  the 
algorithm  became  dapped  at  a  local  minima.  The  variance  parameters  are  the  same  as  tho.se  given  by  .Solis  &  Wets-*’. 


1 
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Fig.  I.  The  modified  Solis  &  Wets  technique  applied  Fig.  2.  The  unmodified  Solis  &  Wets  technique 

to  finding  the  global  minimum  of  the  Bohachevsky  applied  to  finding  the  global  minimum  of  the 

funcntion.  Averaged  over  ten  trials.  Bohachevsky  function.  Averaged  over  ten  trials. 

The  power  of  the  unmodified  Solis  &  Wets  technique  aj^iears  self-evident  as  illustrated  by  Figures  1  and  2  where 
log(J)  is  the  logjfffix)).  The  observation  might  be  made  that  reducing  the  variance  independent  of  the  objective  function  is 
beneficial  to  the  search.  It  should  be  noted  that  the  lower  bound  on  the  standard  deviation  of  the  random  perturbation  for 
the  unmodified  method  was  set  at  =  KT*.  Higher  precision  might  have  been  attained  (on  average)  at  the  expense  of 
more  frequent  entrapment  in  local  minima.  To  avoid  entrapment  conditions  it  is  suggested  that  a  global  search  agent  be 
employed  in  parallel  with  the  local  method. 

2.2.  The  Evolutionary  Programming  Paradigm 

In  1958,  Brooks^^  described  a  creeping  random  method  where  k  points  were  generated  via  Gaussian  perturbations 
about  a  search  point.  The  best  point  was  kept  and  the  process  repeated.  Brooks  observed  that  "there  are  some  rather 
intriguing  analogies  that  can  be  made  between  the  creeping  random  method  and  evolution".  This  analogy  was  also 
apparent  to  Fogel  et  al.^  who  proposed  a  multi-agent  search  strategy  incorporating  a  population  of  organisms  that  are 
mutated  to  yield  offspring.  The  resulting  search  strategy  was  termed  evolutionary  programming  (EP). 

EP  is  a  multi-agent  stochastic  search  (MASS)  paradigm  used  for  finding  global  extrema.  Although  the  EP 
methodology  simulates  the  evolutionary  process  found  in  nature,  the  mechanisms  incorporated  in  this  framework  and 
resulting  characteristics  may  also  be  found  in  some  of  the  stochastic  optimization  techniques  previously  discussed.  The 
perturbation  applied  in  the  search  is  typically  a  multivariate  normal  random  variable  5x  ~N(0,  Sj^-  J)  where  Sj  is  the  .scale 
factor  and  J  is  the  magnitude  of  the  objective  function.  Hill-climbing  and  tunneling  are  achieved  though  the  application  of 
the  mutation  operator  in  concert  with  the  multi-agent  capabilities  of  EP.  Analogous  to  the  hill-climbing  ability  of  simulated 
annealing  relaxation  methods,  EP  employs  a  competition  process  which  allows  less  fit  organisms,  (.search  points)  to  be 
retained  in  the  population  in  a  probabilistic  fashion.  The  EP  optimization  algorithm  can  be  described  by  the  following 
steps^ 


/.  Form  an  initial  population  P2N-1M  of  size  2N.  The  parameters  x  associated  with  parent  element  F,  are  randomly 
initialized  from  a  user  specified  search  domain. 

2.  Assign  a  cost  to  each  element  Pfx)  in  the  population  based  on  the  associated  objective  function  J,. 

3.  Reorder  the  population  based  on  the  number  of  wins  generated  from  a  stochastic  competition  process. 

4.  Generate  offspring  (Pfj ....  P^s.i)  from  the  highest  ranked  N  elements  (Pq  ....  P/^  f  in  the  population  hr  perturbing  x 
with  a  multivariate  normal  random  variable  bx  ~N(0,  Sj-  J). 

5.  Loop  to  step  2. 


Fig.  3.  Optimization  of  the  Bohackevsky  function  using  Fig.  4.  Augmenting  EP  with  bisection  search.  Half  of 

the  EP  algorithm,  Sri,  so  parents,  20  competitions.  offspring  were  generated  by  mutation,  the  other 

half  by  recombination. 

The  EP  algorithm  was  applied  to  the  Bohachevsky  function  with  the  results  shown  in  Fig.  j  Fnr  these 
experiments,  the  scale  factor  was  set  as  5y-/  for  50  parents  with  one  offspring  each  and  ten  runs  were  made.  There  were  20 
competitions  held  for  each  member  of  the  population.  A  bisection  search  was  implemented  by  allowing  randomly  chosen 
parents  to  recombine  according  to  =  0.5(x,  +  x^  where  x^  and  Xj  are  the  randomly  chosen  parent  vectors  and  Xg  is  the 
offspring  vector.  This  was  done  for  half  the  offspimg  while  the  other  half  wore  generated  using  the  perturbation  approach 
8x  ~N(0,  J).  Nearly  an  order  of  magnitude  improvement  was  observed  as  shown  in  Fig.  4.  The  next  experiment  decoupled 
the  cost  function  from  the  perturbation  size  so  that  8x  ••N(0, 1).  An  order  of  magnitude  improvement  was  observed  as  shown 
in  Fig.  4.  Fogel^  reports  that  it  took  an  average  of  65.5  generations  to  achieve  logjfffix))  <  using  the  same  number  of 
parents  and  offspring  in  the  general  EP  method.  By  decoupling  the  cost  function  and  implementing  a  simple  bisection 
search,  it  took  less  than  30  generations,  on  average,  to  achieve  similar  results. 

2.3.  A  Hybrid  Approach 

A  variant  of  the  EP  search  strategy  is  proposed  to  take  advantage  of  the  efficiency  of  convex  optimization  (such  as 
the  bisection  search)  as  well  as  the  global  search  capability  provided  by  MASS  strategies.  From  the  previous  experiments 
the  following  capabilities  appear  to  be  beneficial  to  a  hybrid  approach:  multi-agent  search  tends  to  avoid  local  minima;  if 
the  offspring  does  not  yield  a  lower  cost,  then  check  in  the  other  direction;  convex  optimizaton  and  perturbation  variance 
reduction  improves  precision.  Making  the  perturbation  vari?"ce  proportional  to  the  value  of  the  cost  function  does  not 
appear  to  provide  significandy  better  results  and  may  even  inhibit  the  rate  of  convergence  with  respect  to  other  approaches. 
Decoupling  the  perturbation  variance  from  the  cost  function  value  may  prove  beneficial  since  oftentimes  the  shape  of  the 
search  surface  is  not  well  known  and  may  even  take  on  negative  values.  A  similar  strategy  was  employed  by  Waagen  et 
al.^°  in  the  formulation  of  the  stochastic  direction  set  method. 

Fig .  5  illustrates  a  hybrid  approach  which  generates  a  variety  of  different  offspring  within  the  EP  framework.  The 
first  set  of  offspring  are  generated  to  explore  the  search  space  in  a  global  fashion.  These  offspring  mutate  the  parent  search 
points  with  the  perturbation  5x  ~N(0,o)  where  a  is  fixed.  The  second  set  of  offspring  implement  convex  optimization 
through  recombination.  This  may  be  as  simple  as  the  bisection  method  used  above  The  final  set  of  offspring  are  generated 
using  the  method  of  Solis  &  Wets.  This  set  of  offspring  replace  the  parent  organisms  since  they  are  equivalent  to  or  better 
than  their  precursors.  The  bias  term  which  provides  a  momentum  to  the  search  must  also  be  included.  Offspring  generated 
by  the  first  method  have  the  bias  term  set  to  zero.  The  recombination  rules  which  apply  to  the  second  set  of  offspring  can 
also  be  applied  to  other  parameters  including  the  bias  term  p. 


New  Population 


Fig.  5.  One  generation  of  the  hybrid  multi-agent  stochastic  search. 


A  variety  of  other  techniques  may  be  employed  as  alternatives  to  the  stochastic  competition  process.  Saravanan*^ 
replaces  the  least  fit  organism  in  the  population  with  each  additional  offspring.  Another  strategy  might  be  to  retain  less  fit 
organisms  by  disallowing  competition  across  different  levels  of  maturity.  The  maturity  of  an  organism  could  be  determined 
by  how  many  generations  it  has  existed  within  the  population.  A  deterministic  means  to  minimize  redundancy  might  also 
be  employed.  To  delay  the  potential  dominance  of  a  single  organism  in  the  population,  the  number  of  competitions  can  be 
set  to  an  arbitrarily  low  value.  This  allows  the  retention  of  less  fit  organisms,  thus  providing  a  more  exhaustive  search.  As 
the  number  of  competitions  increases,  the  retention  of  the  best  fit  individuals  becomes  more  deterministic. 

The  hybrid  tqrproach  was  employed  on  the  Bohachevsky  function  with  the  average  cost  from  ten  uials  shown  in 
Fig.  6.  Fcff  this  particular  example,  it  appears  that  the  hybrid  technique  imim>ves  the  efficiency  of  the  search  from  both 
precision  and  convergence  aspects.  As  shown  in  Fig.  6,  the  hybrid  technique  attained  10  orders  of  magnitude  precision 
within  50  generations  (which  corresponds  to  a  maximum  of  7550  function  evaluations).  A  comparison  between 
Solis&Wets'  technique  and  the  hybrid  approach  which  utilizes  their  method  was  also  made  for  the  Rosenbrock  function. 

This  function  is  referred  to  as  a  banana  valley  since  it  contains  a  steep  valley  along  Xj  =  Xi .  The  Rosenbrock  function  is 

given  by  /(x)  =  l00(Xj  -X2f  +(l-Xif.  Fig.  7  shows  the  average  and  best  results  from  ten  trials  for  both  the  Solis  & 
Wets  technique  and  the  hybrid  approach  outlined  above.  To  compare  the  search  processes  based  on  the  number  of  function 
evaluations,  each  generation  is  equivalent  to  a  maximum  of  150  function  evaluations  fa*  the  Solis  &  Wets  method  and  a 
maximum  of  150  function  evaluations  for  the  hybrid  approach.  Both  average  curves  show  optimization  is  still  occurring 
after  the  maximum  number  of  generations  or  ito^tions  were  reached  and  the  experiment  stopped. 


3.  EVOLVING  PERCEPTRONS 


This  section  investigates  the  application  of  EP  to  evolving  a  recursive  adaptive  linear  combiner‘s  with  a  nonlinear 
output  or  activation  function.  The  neuron  model  will  serve  as  a  next-step  predictor  and  is  given  by 


+  =  f 


>n-l  fi-l 

^a,y(k-i)  +  ^b^y(k-  j  +  l) 


1=0 


/=! 


where  the  search  strategy  must  determine  the  order  of  the  feedforward  terms,  m,  the  order  of  the  feedback  terms,  n,  as  well 
as  the  feedforward  coefficients,  a,-,  and  the  feedback  coefficients,  bj.  This  structure  is  shown  in  Fig.  8. 

The  sigmoid  function  is  a  candidate  for  the  nonlinear  mapping  since  it  approximates  the  perceptron's  output  as  an 
inverted  polynomial  series 


{ 

1-1- 

V  '“to 


(-a:)' 


•  J 


Other  nonlinear  functions  may  be  equally  applicable  as  discussed  by  Cybenko^‘.  For  example,  a  polynomial  series  is 
generated  by 

y? 

/(x)  =  ox(x)+sin(x)‘=l  +  x - -t- — + — ... 

2!  3!  4!  5! 

or,  the  search  may  even  be  conducted  over  a  set  of  candidate  mapping  functions  F  such  that  /e  F. 

The  cost  function  for  each  perceptron  is  given  by  the  Akaike  information  criterion  (AIC)^^ 


//C  ( m, /I )  =  iV  In  ( ) -1- 2(  m -I- n -F 1 ) 

where  N  is  the  effective  number  of  observations.  An  additional  factor  of  2  is  added  to  the  number  of  fitted  parameters  (m+n- 
1)  since  m  and  n  must  also  be  determined  by  the  search  strategy.  If  a  bias  term  bQ  is  added  to  the  perceptron.  then  the 
number  of  fitted  parameters  must  be  increased  to  {m+n+2).  The  MLE  of  the  innovation  variance  is  determined  according  to 


If  it  is  desired  that  direct  linear  feedthrough  (DLF)^^  capabilities  be  present  in  parallel  with  the  nonlinear 
contributions,  then  the  perceptron  structure  can  be  reformulated  as  a  combination  of  linear  and  nonlinear  recurrencies.  This 
idea  is  express  by 


y(,k  +  l)=yi^{k  +  \)+%{k  +  l) 


m-\  n-\ 

+ 1)  =  ^a,y{k  -  i)  +  '^b-yik  -j  +  l) 

i=0  j=l 


where 


Fig.  8.  The  recurrent  perceptron  structure  used  for 
next-step  prediction. 


Fig.  9.  The  parallel  linear-nonlinear  recurrent 
structure  used  for  next-step  prediction. 
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y^(/:  +  l)  =  /  '^cjik-i)  +  '^djy(k-  j  +  l) 
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If  this  structure  is  used,  thert  the  AIC  score  is  described  by 

AIC{m,n,p,q)  =  N]n(al)  +  2(m+  n  +  p  +  q  +  2) 

The  number  of  fitted  parameters  is  (m+n+p+q+4)  if  both  and  biases  are  included.  The  parallel  linear-nonlinear 
structure  is  shown  above  in  Fig.  9. 

4.  PREDICTION  RESULTS 

4.1.  Predicting  Sunspots 

The  first  set  of  experiments  were  conducted  on  Wolfs  sunspot  series  over  the  years  1700-1983.  These  numbers  are 
indicative  of  the  average  relative  number  sunspots  observed  eacn  day  of  the  year  .  Consistent  with  Weigend  el  sunspot 
data  from  1700-1920  was  used  for  training  and  data  from  1921-1983  was  used  fw  testing.  All  of  the  data  was  normalized 
by  a  factor  of  200.  A  small  number  of  experiments  were  run  for  1000  generations  with  20  parents,  20  offspring  (10  for 
convex  optimization  and  10  for  global  mutation),  10  competitions,  and  P|^  =  0.01 .  The  maximum  possible  model  order  was 
equivalent  for  both  the  feedforward  and  feedback  lines  'n,^=n,^=p^=q^=2I .  The  results  are  given  in  Table  1.  Fven 
though  the  AIC  criterion  is  used,  some  degree  of  overfitting  may  still  occur  as  indicated  by  the  sepond  entry  in  Table  I. 
This  experiment  yielded  the  lowest  MSE  on  the  training  set  but  did  not  generalize  as  well  as  oiiici  lesuli*  on  the  test  set. 

The  solutions  which  yielded  the  best  combined  test+training  AIC  scores  are  shown  in  Figs.  10- 13.  Figs.  10  and  1 1 
.show  the  training  and  test  results,  respectively,  for  a  parallel  linear-hyperbolic  tangent  8-3-10-3  implementation.  A  linear 
bias  was  also  incorporated  for  this  trial.  Figs.  12  and  13  show  the  training  and  test  results,  respectively,  for  a  simple  linear 
implementation  of  Fig.  8.  Again,  a  bias  term  was  incorporated  in  this  experiment.  It  is  interesting  to  note  that  the  evolved 
structure  for  the  linear  system  did  not  include  recurrency  and  relies  only  on  the  three  previous  observations  and  a  bias  term 
Another  interesting  observation  is  that  there  is  only  a  slight  difference  between  the  resulting  prediction  curves  shown  in 
Figs.  11  and  13.  The  better  training+test  AIC  scores  are  comparable  to  the  AIC  .scores  given  by  PriesUey^'^  for  AR  and 
ARMA  process  models.  As  another  comparison,  a  pure  follower  strategy  where  the  next-step  prediction  is  just  the  current 
value  yields  an  AIC=  -1 192.5  on  the  first  278  data  point-s. 


functions 


linear+bias+sin 


linear+bias+sin 


linear+bias+tanh 


linear+bias+ 
tanh(w/  bias) 


linear+bias 


m-n-p-q 


6-5-11-4 


4-4-2-2 


8-3-10-3 


-5 


3-0-x  X 


0.00508 


0.00470 


0.00498 


0.00489 


0.00574 


0.00588 


0.00678 


0.00566 


0.00597 


0.00686 


Table  1.  Results  from  sample  experiments  evolving  next-step  sunspot  predictors. 


Fig.  12.  The  sunspot  training  results  for  f=linear+bias  Fig.  13.  The  sunspot  test  results  for  f=linear+hias  with 


with  structure  3-0. 


structure  3-0. 


4.2.  Th«  Iterated  Quadratic  Map 


Weigend  el  alV  use  the  iterated  quadratic  map  tt+i  =4x^(1 -x^)  on  die  unit  interval  as  an  example  ol 
deterministic  chaos.  I'l .  ease  in  (Xiint  precludes  any  long-term  predictability  although  short  term  predictions  are  ptissible 
Weigend  ei  at.^'  ir.u  ate  that,  on  average,  predictive  performance  is  lost  after  n  iterations  where  n  is  the  number  of  hits 
representing  X;  .his  problem  is  essentially  an  interpolation  in  state  space  as  demonstrated  by  Fig.  14  Ihe  same 
parameters  were  used  as  in  the  previous  experiment.  Fig.  15  shows  the  performance  of  the  evolved  predictor 
Xjt^..  -  sin(11476Xi)-^ 0.0274  on  the  naining  set  after  5000  training  generations.  This  model  yields  an  AIC  score  of  -1033 
and  an  MSfi=0.00533  on  the  training+test  data.  This  solution  was  evolved  from  a  parallel  arrangement  where  one  nixie 
implements  the  cos  mapping  and  the  parallel  node  implements  the  sin  mapping.  The  state-space  plot  in  Fig  14  designates 
the  evolved  solution  as  marked  by  the  '*’s.  The  test  results  are  shown  in  Fig.  16.  Fig  17  shows  the  test  results  if  the 

predictor  takes  the  form  x*^,  =  sinfru* ).  The  state-space  plot  of  this  solution  is  designated  by  the  o's  in  Fig  14  As  in  the 
previous  example,  a  solution  was  found  which  did  not  rely  on  recurrency  to  approximate  the  time-senes. 


Fig.  14.  State-space  plot  of  observed  data  (+),  evolved  Fig.  IS.  Training  data  and  the  results  generated  from 

solution  (*),  and  an  estimate  (o)  generated  by  inspection  the  evolved  solution  Xt  ^,  =sin(3.1476Xj)  + 0.0274. 

of  the  evolved  solution. 


k 


Fig.  16.  Test  data  and  the  evolved  solution  of  the  form 

sin(  3. 1476x*)  + 0.0274. 


ranom 


Fig.  17.  Test  data  and  the  estimate  x,^,  =  sinfitr, ) 
generated  from  inspection  of  the  evolved  solution. 


4.3.  The  Mackey-Glass  Equation 


The  Mackey-Glass  equation  represents  a  model  lor  white  blood  cell  production  in  leukemia  patients'’  I  his  model 
is  complicated  by  the  addition  of  a  lime  delay  t  in  the  nonlinear  differential  equation 


^  ax(t-T)  .  ,  ^ 
-r(G  =  ■; - - -  -  bxii ) 


where  the  free  parameters  are  set  as  a=0.2,  b=0.1,  c=IO.  and  "1=30  according  to  Jones  et  .  In  all  experiments  500  data 
points  were  in  the  training  set  with  the  test  set  comprising  the  subsequent  500  data  points.  All  of  the  data  was  normalized 
by  a  factor  of  14  The  experiments  also  used  the  parallel  sin-cos  nodal  arrangement  with  a  bias  as  used  in  the  previous 
example.  Fig.  18  shows  the  training  data  and  the  results  from  a  5-7-3-3  configuration.  The  uaining  set  trained  to  a 
MSE=0.0005  and  AIC=-3609  after  5000  generations.  The  test-i-training  set  have  a  MSE=0.00028  and  AIC=-7968.  Fig.  19 
shows  the  prediction  capabilities  of  this  model  on  the  test  set.  A  more  challenging  section  of  the  Mackcy-Glass  equation 
was  evaluated  as  shown  in  Fig.  20.  A  new  model  was  generated  and,  for  1000  generations  of  training,  a  MSE=0.00109  and 
A1C=-3181  were  attained  on  the  first  500  points  with  a  10-7-8-8  configuration  structure.  For  this  model,  a  MSE=0.00066 
and  AIC=-7089  were  achieved  for  the  full  sequence  of  test-i-training  data.  The  error  for  the  test  and  training  sequences  is 
shown  in  Fig.  21. 


Fig.  18.  The  training  results  for  a  5-7-J-J  configuration.  Fig.  19.  The  test  results  for  the  5-7-3- J  configuration. 


Fig.  20.  The  training  and  test  results  for  a  10-7-8-8  Fig.  21.  The  error  for  the  sequence  shown  in  Fig.  20. 


configuration. 


5.  CONCLLSION 


This  work  has  incorporated  an  efficient  single-agent  search  strategy,  the  method  of  Sohs&Wets,  into  the  EP 
framework  and  augmented  this  with  convex  optimization  capabilities  to  yield  a  hybrid  multi-agent  stochastic  search 
technique.  The  method  was  applied  to  parallel  linear-nonlinear  HR  filters  for  next-step  prediction  tasks.  The  evolved 
solutions  did  not  always  have  a  recurrent  structure  and,  as  a  result,  simple  implementations  were  found  using  this  approach. 
Oftentimes,  simplicity  is  not  an  option  when  a  large  non-dynamic  architecture  is  specified  for  a  given  task.  The  lack-of- 
complexity  of  the  evolved  solutions  for  the  standard  problem  set  investigated  in  this  work  shows  great  promise  for  future 
implementations.  It  remains  to  be  investigated  as  to  how  well  this  approach  scales  up  to  more  complex  data  sets.  The 
incorporation  of  the  output  transfer  function  into  the  search  would  be  the  next  evolution  of  this  work. 

The  simple  recurrent  structures  investigated  in  this  woik  demonstrated  a  high  degree  of  capability  frw  the  time- 
series  problems  studied.  Similar  levels  of  proficiency  were  attained  for  a  varied  assortment  of  models.  This  might  lead  one 
to  conclude  that  the  joint  parameter-function  space  is  dense  in  the  number  of  possible  solutions,  some  of  which  are  found  by 
the  relaxation  scheme  employed  in  this  investigation.  By  cascading  and/or  parallelizing  as  in  traditional  feedforward  neural 
network  architectures,  the  recunent  nodes  may  present  additional  capabilities  for  more  complex  time-series  processing 
tasks. 
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